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Abstract 


In  this  thesis.  Large  Grain  Data  Flow  (LGDF)  representation  of  parallelism  is 
applied  to  throughput-critical  applications  that  process  periodically  arriving  data.  The 
applications  are  represented  by  directed  acyclic  graphs  in  which  a  vertex  represents  an 
indivisible  node  program  execution  and  an  arc  represents  data  flow  from  its  source  node  to 
sink  node.  The  machine  and  graph  parameters  are  assumed  to  be  such  that  die  time  to 
transfer  one  unit  of  data  is  comparable  to  the  time  to  execute  one  operation  at  a  processor. 
The  machine  model  consists  of  a  set  of  processors  connected  to  a  set  of  memory  modules 
by  a  cross-bar  interconnection  network.  Execution  of  LGDF  graphs  on  such  machines 
either  requires  a  run-time  mechanism  to  dispatch  executable  nodes  on  available  processors 
or  a  compile-time  static  scheduling  of  nodes  to  processors.  The  former  approach,  although 
flexible  and  robust,  suffers  from  contention-related  overhead  and  the  latter,  although 
capable  of  eliminating  contention,  is  rigid  and  computationally  intensive. 

It  is  shown  by  simulation  that  throughput  can  be  improved  when  compile-time 
graph  restructuring  is  coupled  with  simple  first-come-first-serve  dispatching.  The 
restructuring  is  based  on  selectively  adding  control  dependencies  between  graph  nodes. 
This  technique,  called  the  revolving  cylinder  analysis,  is  shown  to  oe  an  effective 
framework  for  achieving  communication  /  computation  overlap  and  reducing  memory 
contention. 
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L  INTRODUCTION 


The  modem  military  depends  on  real-time  digital  aignal  processing  applications  (such  as  radar  and 
tonar).  There  application!  generate  huge  amoanla  of  data  continuously.  Moat  of  the  data  is  of  a  time-critical 
nature  which  must  be  processed  quickly  and  accurately.  AdvMee*  tncnni|«ter  technology  have  made  it  mnch 
easier  to  analyze  this  data.  However,  the  signal  processing  applications  are  constantly  improving  also, 
generating  even  more  data  more  quickly. 

Large  Grain  Data  How  (LGDF)  graphs  can  be  mod  to  represent  these  applications.  Data  flow  graphs 
not  only  describe  die  dependencies  between  (Efferent  parts  of  die  computation  required  in  an  application,  but 
also  provide  built-in  «rfn^tnimg  and  synchronization.  For  example,  on  a  hypothetical  system  with  no 
cnmmnwifjirinH  coat  and  an  unlimited  number  Of  processors,  nnttea  can  synchronize  by  Maiding  data  Mid  a 
node  can  be  scheduled  as  soon  as  all  the  required  data  is  present  at  its  input.  Due  to  the  generality  of  this 
representation,  it  can  be  osed  to  specify  paraUdiam  at  the  instruction  level  [Ref.  1]  as  well  as  at  the  task  level 
[Ref.  2],  The  theoretical  foundation  for  the  consistency  of  such  representations  has  been  well  studied  [Ref.  3. 
Ref.  4]. 

In  practical  «mpi*anwirwinn«  of  this  paradigm,  die  machine  must  provide  mechanisms  to  manage  die 
data  that  flows  through  the  graph  and  to  capture  the  intrinsic  scheduling  and  synchronization.  These 
twrimrimn,  typically  operating  at  run-time,  result  in  overhead  that  leads  to  subaptimal  performance.  The 
amount  of  overhead  depends  critically  on  the  granularity  of  die  parallelism  expressed  by  the  graph  and  on 
whether  the  computations  have  conditionals  and  recursion.  A  direct  implementation  in  hardware  of  the  data 
flow  paradigm  for  general  applications  results  in  unmanageable  overheat  [Ref.  1,  Ref.  5]. 

Any  data  flow  implementation  must  perform  buffering  and  fetching  of  data,  allocation  of  graph  nodes 
to  processors,  their  ordering  on  each,  and  the  exact  times  at  which  they  are  scheduled.  If  all  die  related 
decisions  are  done  at  run-time,  the  efficiency  of  die  implementation  suffers.  The  overhead  can  be  reduced 
effectively  by  using  the  node  and  arc  attributes  of  die  data  flow  graph  re  compile-time  to  simplify  die  run- 
rime  management  Based  on  which  decisions  are  made  at  compile-time  and  which  ones  are  made  re  run-time, 
data  flow  impiementationa  can  be  classified  over  a  spectrum  that  ranges  from  felly  static  to  felly  dynamic 
[Ref.  6}.  While  dynamic  implementations  have  more  overhead,  they  are  more  flexible  and  are  easier  to 

tfwptowMttu  They  also  degrade  gracefaByfe  tire  event  of  individual  processor  malfunction  On  Am  other  hand, 
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static  implementations  are  mare  efficient  and  lead  to  predictable  performance  which  is  crucial  to  real-time 
systems.  However,  they  are  difficult  to  realize,  are  inflexible,  and  do  not  degrade  gracefully.  Their 
effectiveness  is  determined  by  how  accurately  the  computational  requirements  of  the  application  are  known. 
This  is  typically  a  difficult  problem  and  its  solution  of  using  die  worst-case  estimate  can  result  in  large 
inefficiencies. 

Therefore,  real-time  systems  must  strike  a  careful  balance  between  the  compile-time  effort  and  ran-time 
complexity  to  get  die  desired  and  guaranteed  performance.  Far  classes  of  applications,  such  as  signal 
processing,  such  balance  can  be  obtained  by  exploiting  two  properties  of  die  computations  required,  the 
availability  of  a  priori  knowledge  of  die  amount  of  data  produced  and  consumed  and  negligible  use  of 
mnriifinwk  and  recursion.  When  the  amounts  of  data  produced  and  consumed  by  die  nodes  of  a  data  flow 
graph  are  known  exactly,  the  applications  are  called  synchronous  data  flow  applications  [Ref.  2] .  When  die 
data  arrives  periodically,  they  have  been  classified  as  pipelined  function-parallel  computations  [Ref.  7].  In 
real-time  signal  processing  applications,  the  trade-off  between  compile-time  and  nm-time  has  an  «kKH«mi 
dimension  because  of  die  periodic  arrival  of  data.  When  external  data  arrives  periodically,  the  intrinsic  non- 
determinism  of  data  flow  execution  results  in  unpredictable  program  behavior.  As  a  result,  processed  data 
arrives  unpredictably  leading  to  die  possibility  of  intolerable  delays  and  insufficient  buffer  space,  especially 
under  high  loads. 

A.  THESIS  SCOPE  AND  CONTRIBUTION 

The  focus  of  this  work  is  on  compile-time  mechanisms  for  contrnlfing  data  flow  execution.  A  technique, 
called  revolving  cylinder  (RQ  analysis  originally  introduced  in  [Ref.  8],  in  which,  instead  of  generating 
information,  such  as  schedules,  to  control  allocation  or  ordering  an  processors  at  nm-time,  a  new  data  flow 
graph  is  obtained  at  campile-time  which  gives  a  better  throughput  and  behaves  more  predictably  than  the  old 
graph  under  the  same  nm-time  mechanism  The  key  idea  in  restructuring  based  on  RC  analysis  is  that 
inserting  dependencies  in  die  graph  can  produce  t  graph  with  better  perfamumce.  This  idea  can  be  traced  back 
to  algorithms  for  overlapping  complex  operations  on  pipelined  processors  [Ref.  9].  This  restructuring 
selectively  changes  die  conditions  when  a  node  will  enter  the  list  of  executable  nodes;  however,  choosing  die 
processor  to  schedule  it  on  is  left  to  the  run-time  dispatcher.  This  enables  die  actual  TJmrfnimg  to  remain 
dynamic  keeping  the  nm-time  overhead  low. 

This  diesis  defines  a  model  for  a  Large  Grain  Data  How  system,  which  is  loosely  based  on  the 

Department  of  die  Naw  AN/UYS-2  Digital  Signal  Pmcoaxlnf  Svttan  (alan  known  th*  Fnh»nr*rf  Mnrfnlur 
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Signal  Processor,  EMSP)  [Ref.  10].  Baseline  results  will  be  generated  to  show  that  it  is  possible  to  improve 
die  system  throughput  over  that  offend  by  firet-come-first-serve  (FCFS)  scheduling  by  compile-time 
restructuring  of  tbe  LGDF  programs  following  the  RC  technique.  The  utility  of  several  computer  programs 
designed  to  analyze  this  LGDF  model  and  FCFS  and  RC  scheduling  will  be  verified  with  the  generation  of 
the  results. 

B.  THESIS  ORGANIZATION 

Chapter  n  describes  fully  die  LGDF  system  model.  Included  are  descriptions  of  die  hardware  and 
software,  along  with  die  joint  hard  ware/software  view.  Chapter  ID  is  a  description  of  die  FCFS  and  RC 
scheduling  techniques.  Chapter  IV  is  an  analysis  of  die  data  generated  for  die  LGDF  model  using  all  the 
scheduling  techniques  presented.  Chapter  V  summarizes  the  results  and  presents  possible  topics  for  future 
study. 

C.  ADDITIONAL  RESEARCH 

Additional  results  and  further  analysis  of  die  concepts  in  this  thesis  are  included  in  [Ref.  11].  The 
computer  programs  used  to  generate  the  results  in  this  thesis  are  described  in  detail  with  complete  examples 
and  program  listings  in  [Ref.  12]. 
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IL  THE  LARGE  GRAIN  DATA  FLOW  MODEL 

A  Large  Grain  Data  How  (referred  to  as  LGDF)  computer  system  can  be  defined  in  terms  of  die  two 
major  categories  which  are  nsed  to  define  most  computer  systems,  hardware  and  software. 

A.  SOFTWARE  MODEL 

The  software  model  of  a  data  flow  system  is  nsnally  visualized  as  a  graph.  There  are  two  primary 
dements  to  this  data  flow  graph,  nodes  and  queues.  There  are  five  secondary  dements  to  die  graph,  system 
input  nodes,  system  output  nodes,  system  input  queues,  system  output  queues,  and  synchronization  arcs. 
These  secondary  elements  are  necessary  for  the  computer  program  which  models  this  system.  Figure  2.1  is  a 
simple  data  flow  graph  example  showing  the  graph  symbols.  Note  that  there  are  no  special  symbols  for  system 
input  and  output  queues,  they  are  determined  by  their  attachment  to  the  system  input  and  output  nodes. 


Figure  2.1.  Data  Flow  Graph  Example 
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1.  Terms 

There  tre  several  important  texms  which  will  be  defined  here. 

a.  Cycle 

The  tern  ‘cycle’  is  used  to  describe  an  arbitrary  time  nnit  It  could  represent  tny  gait  of  time, 
hot  is  usually  interpreted  as  a  microsecond 

b.  Word 

The  term  ‘word’  is  used  to  describe  an  arbitrary  data  element.  In  the  model,  k  could  represent 
any  unit  of  data  size,  but  is  usually  interpreted  as  a  byte. 

c.  Procuring 

‘Processing’  refers  to  all  activities  performed  by  a  node  on  a  processor.  This  includes  actual 
node  execution,  the  transfer  of  infomatioo  between  the  processor  and  memory  (both  instruction  and  data), 
and  any  latency. 

d.  Execution 

‘Execution’  refers  only  to  die  actual  execution  of  die  node  on  a  processor  to  smwipKrfi  a 
given  task.  It  does  not  include  any  memory  operations  or  latency  involved  with  those  operations. 

r.  Input  and  Output 

The  terms  ‘input’  and  ‘output’  are  used  in  many  varied  contexts.  To  eliminate  the  confusion, 
any  reference  to  the  beginning  and  end  points  into  the  graph  are  referred  to  as  ‘system’  inputs  and  outputs. 

2.  Nodes 

Nodes  represent  software  modules  which  perform  a  specific  function.  This  module  could  be  a 
program  or  a  subroutine  or  a  function.  What  is  inside  the  node  is  not  important  to  model  die  LGDF  system. 
The  model  is  only  concerned  with  the  length  of  time  it  will  take  die  node  to  complete  its  given  operation  and 
the  amount  of  data  input  into  die  node  and  output  from  the  node. 

In  this  model,  a  node  is  characterized  by  several  parameters. 

a.  Execution  Time 

The  execution  time  (in  cycles)  is  die  time  required  by  the  node  to  complete  Us  function  ooce 
die  data  and  the  node  instructions  have  been  loaded  onto  a  specific  processor. 


5 


ft.  Setup  Ttm 

The  setup  time  0n  cycles)  represents  a  constant  latency  before  a  node  is  able  to  access  any 
memory  modules  after  being  assigned  to  a  processor. 

c.  Breakdown  Tim 

The  breakdown  time  On  cycles)  represents  a  constant  latency  for  the  node  that  has  completed 
memory  operations  before  the  processor  is  made  available  in  the  free  processor  pool. 

d.  Instruction  Six* 

The  instruction  size  is  listed  in  words.  The  instruction  size  is  used  to  determine  bow  long  it 
wffl  take  to  load  the  code  segment  represented  by  the  node  to  a  processor  for  execution.  This  time  is  dependent 
an  the  data  transfer  rate  of  the  hardware. 

r.  Processor  Type 

The  processor  type  is  used  to  specify  nodes  which  must  use  a  specific  type  of  processor. 

3.  Queues 

Queues  are  used  to  represent  die  flow  of  data.  Each  queue  connects  a  pair  of  nodes,  and  die  amount 
of  data  transferred  between  die  nodes  is  identified.  Data  is  transferred  from  the  node  at  tbe  tail  of  the  queue 
(named  the  source  node)  to  the  node  at  the  head  of  the  queue  (named  the  sink  node). 

In  this  model,  a  queue  is  characterized  by  several  parameters. 

a.  Threshold  Amount 

The  threshold  amount  is  die  amount  of  data  On  wards)  required  to  be  on  die  queue  for  the 
sink  node  to  begin  execution. 

ft.  Produce  Amount 

The  produce  amount  is  the  amount  of  data  (in  wards)  added  to  the  queue  npon  completion  of 
one  execution  instance  of  the  source  node. 

c.  Consume  Amount 

The  consume  amount  is  the  amount  of  data  (in  words)  removed  from  the  queue  npon  the  start 
of  one  execution  instance  of  the  sink  node. 
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d.  Write  Amount 

The  write  amount  it  die  amount  of  data  On  words)  written  from  the  source  node  to  memory 

i.  Read  Amount 

The  read  amount  is  the  amount  of  data  On  words)  read  by  die  sink  node  from  memory  prior 
frt  beginning  of  one  execution  instance 

/  Capacity 

The  capacity  is  tfae  total  amount  of  data  On  words)  which  can  be  stored  oo  the  queue.  If  die 
capacity  of  the  queue  would  be  exceeded,  a  source  node  cannot  produce  any  more  data  until  tfae  sink  node 
consumes  data  to  open  space  oo  tbe  queue. 

g.  Initial  Length 

The  initial  length  is  the  amount  of  data  On  wards)  is  placed  on  the  queue  at  system  start-op. 

h.  Relationship  among  the  Parameters 

There  are  several  important  distinctions  to  be  made  between  die  parameters.  It  would  appear 
that  tltepwnrint^  and  write  amnnnte  are  equivalent  snri  the  coranme  Mid  iMrisinnmiteMpaqirivalMf  Bprwf 

data  queues,  the  produce  and  write  amounts  would  be  die  same  quantity  as  would  consume  and  read  amounts. 
However,  the  functions  performed  are  distinctly  different  The  read  and  write  amounts  represent  actual  data 
transfers  required  between  a  processor  and  memory.  These  transfers  require  a  large  amount  of  tfa n>  to 
complete.  The  produce  and  consnme  amounts  represent  a  control  function  within  the  scheduler.  No  data  is 
actually  transferred  but  tfae  queue  length  recorded  by  tbe  scheduler  is  adjusted.  The  difference  would  become 
more  obvious  when  synchronization  ares  are  discussed. 

There  is  one  nayor  requirement  to  be  met  by  tbe  parameters.  This  requirement  is  that  tfae 
capacity  of  die  queue  must  be  greater  than  or  equal  to  tbe  threshold.  If  this  is  not  tfae  case,  then  there  could 
never  be  enough  data  on  the  queue  to  canae  the  sink  node  to  trigger. 

For  most  data  queues,  the  threshold  and  consume  amounts  will  be  the  same.  This  means  that 
tfae  rink  node  requires  a  set  amount  of  data  to  trigger.  When  tins  threshold  is  reached,  the  sink  node  will 
consume  much  data  in 

In  many  cases,  die  produce  amount  will  also  be  die  same  as  die  threshold  and  consume 
amounts.  This  represents  a  linear  program.  The  source  node  produces  the  exact  amount  of  data  which  is 
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required  tad  used  by  the  sink  node.  However,  this  is  not  always  the  cue.  If  the  produce  amount  it  less  dun 
the  threshold,  then  the  source  node  most  taunt*  multiple  times  before  triggering  the  sink  node.  If  die  procbce 
mm  j|  pMttr  (tm  the  ccmbm  vdqqUL,  die  —ifa  unit  fyay  multiply  times  upon  f«npi»ti«i  of 
the  source  node. 

Hgnre  22  is  a  graphical  repre«s<Jtkm  of  the  qoeae  parameters. 


Figure  2.2.  Graphical  Description  of  Queue  Parameters 

4.  System  Input  Nodes  and  Sjstta  Input  Quests 

System  input  nodes  me  necessary  to  simulate  multiple  execution  instances  of  die  graph.  Upon 
initiation  of  a  graph  instance,  this  node  is  activated.  System  input  nodes  have  die  same  panmeten  as  nodes 
as  defined  above.  However,  system  input  nodes  will  operate  on  a  special  input/output  processor.  The  system 
input  node  is  die  sink  node  of  an  external  queue.  This  external  queue  does  not  ready  exist,  hot  functions  as  a 
queue  with  infinite  capacity  and  a  threshold  and  consume  amounts  of  aoe  data  unit  When  die  graph  instance 
is  initiated,  one  data  unit  is  produced  onto  tins  queue.  The  output  queues  from  die  system  input  nodes  are 
designated  system  input  queues.  They  function  exactly  as  the  data  queues  described  above.  However  the  data 
written  to  them  conies  from  an  I/O  processor. 
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System  output  nodes  are  necessary  to  simulate  multiple  execution  instances  of  the  graph.  Once  all 
the  queues  into  thij  node  have  exceeded  threshold,  this  node  is  activated.  System  output  nodes  have  the  sene 
parameters  as  nodes  ai  defined  above.  However,  system  output  nodes  will  operate  on  a  special  input/outpot 
processor.  The  system  ontpet  node  is  the  source  node  of  an  inherent  queue.  This  inherent  qneae  does  note 
really  exist,  but  functions  as  qneae  with  infinite  capacity.  As  this  system  ootpot  node  is  executed,  it  can  be 
assumed  that  all  input  queues  to  this  node  transfer  the  data  equal  to  the  consume  amount  to  the  outaideai  data 
output.  The  input  queues  to  the  system  ootpot  nodes  are  designated  system  ootpot  queues.  They  function 
exactly  as  the  data  qpeues  described  above.  However  the  data  read  by  them  is  read  by  an  I/O  processor. 

6.  Synchronization  Arcs 

Synchronization  axes  are  a  special  subclass  of  the  queues  described  above.  However,  they  function 
slightly  differently.  They  represent  control  signals  which  will  be  described  later.  Doe  to  die  control  nature  of 
these  arcs,  the  produce  and  consume  amounts  are  generally  one,  representing  a  counter.  However,  the  read 
and  write  amounts  will  always  be  zero.  This  is  because  the  synchronization  arcs  reside  only  in  the  scheduler 
memory,  and  no  data  is  actually  ever  transferred  to  a  processor.  The  threshold  and  initial  length  amounts  are 
highly  variable  depending  upon  the  RC  analysis  and  used  to  trigger  nodes  in  a  specific  order. 


B.  HARDWARE  MODEL 


The  Large  Gnia  Data  How  system  is  a  multiprocessor  system.  Tlie  major  component  of  tfae  system  it 
arithmetic  proceeeof  Addltoel  cBapcentt  th»  wpitjin^ipit  [■w^r  gjjM  wttpcty 

modules,  tfae  scheduler,  end  tfae  data  transfer  network.  Figure  2.3  is  a diagram  at  tfae  LGDF  hardwire  model. 


Figure  13.  Large  Grain  Data  Flow  Hardware  Model 

1.  Arithmetic  Processor 

The  arithmetic  processors  in  this  model  consist  of  two  units,  tfae  execution  unit  and  the  control 
unit  The  nodes  complete  their  tasks  on  the  execution  unit  All  communications  and  setup  and  breakdown 
latency  am  handled  by  die  control  unit  Two  nodes  can  be  processing  on  a  given  processor  during  a  given 
time.  One  node  can  be  doing  a  task  on  the  execution  unit.  The  other  node  can  be  on  the  control  unit,  **hrr 
preparing  to  execute  when  the  execution  unit  is  available,  or  removing  itself  from  the  processor  and  writing 
results  when  finished  execution.  The  arithmetic  processors  are  assumed  to  be  sophisticated,  able  to  control 
many  instructions  and  manipulate  large  amounts  of  data  on  the  chip.  This  means  that  no  data  is  transferred 
during  the  processing  of  a  node,  only  before  and  after  execution. 
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The  inpet/output  processor  acts  oo  dtttattdy  Iran  the  arithmetic  processor!  described  above. 
However,  it  only  hanrtks  tbe  system  inpat  and  oatpat  nodes  sod  system  input  snd  output  queues.  Data  Is 
transferred  fata  aad  oat  of  die  system  through  this  processor.  Tbe  input/oelput  processor  does  not  factor  Into 
adBsatloa  statistics 

3.  Scheduler 

Tbe  scheduler  is  tbe  suit  which  tracks  tbe  entire  system  state.  It  also  acts  ss  a  memory  controller, 
mafatiaiaing  a  table  of  all  tbe  inatmetion  aad  data  locations,  backing  die  qaeae  levels  lo  decide  wbca  to  trigger 
natter  a  h  assumed  Out  fhathter  has  ««*“•«**  int«wMl  memory  *»  nrk  «11  nf  rtw  ayrtem  jufamurim  A 
scbedaler  latency  time,  expressed  in  cycles,  can  be  assigned  to  abstractly  represent  the  time  k  takes  tbe 
scheduler  to  change  tbe  state  of  its  local  memory  when  the  amounts  on  a  queue  are  adjusted. 

4  Global  Memory  Modnle 

The  system  main  rnemosy  is  modeled  as  a  scries  of  modules.  These  modoles  are  considered  global 
since  they  can  be  accessed  by  any  processor.  A  processor  mast  obtain  cootrol  over  tbe  appropriate  memory 
modnle  to  access  a  queue  for  ehher  a  read  or  write  operation,  or  to  load  a  node  instruction.  This  information 
is  supplied  to  tbe  processor  by  die  scheduler.  Multiple  module  accesses  can  progress  simultaneously, 
however,  atany  time,  only  a  single  processor  can  access  a  given  memory  module.  The  size  of  the  menMMy  is 
assumed  large  enough  to  meet  any  requirements.  Nodes  and  queues  can  be  assigned  to  specific  memory 
modules  by  die  user  or  arbitrarily  by  the  scbedaler. 

5.  Data  Transfer  Network 

The  data  transfer  network  is  an  abstraction  in  this  model.  It  is  assumed  that  all  transactions 
between  all  current  processor  and  memory  module  pairings  can  proceed.  No  transaction  will  be  delayed 
because  the  network  is  busy.  Thus,  the  data  transfer  network  acts  as  a  full  crossbar  switching  networic.  There 
is  a  constant  data  transfer  time  to  transfer  one  word  of  data  between  die  processor  and  memory.  This  is  known 
as  the  word  communications  time  expressed  in  cycles  per  word. 


C  OVERALL  SYSTEM  MODEL 

Sections  A  and  B  above  describe  the  software  and  hardware  specifics.  To  define  the  overall  system,  the 
interaction  of  (he  software  ad  die  hardware  mast  be  considered.  His  on  bat  be  duplayed  by  consideri^ 


L  Node  FirycHw 

The  node  is  the  primary  aoftwue  rJrxnrrt  An  LGEF  system  is  designed  so  that  a  node,  when  al 
the  data  is  available,  can  be  assigned  to  any  available  processor  of  the  type  that  tbe  node  requires.  A  ready  list 
is  maintained  of  ill  die  nodes  which  are  ready  to  exeat*. 

The  scheduling  anit  knows  the  structure  of  the  entire  data  flow  graph  and  can  track  die  autos  of 
all  nodes  and  games  These  events  sic  between  the  node  and  the  achcdakr.  Check  if  Data  is  Available.  Check 
if  Data  Space  is  Available,  Check  if  Processor  is  Available.  The  rest  of  the  events  are  between  die  node  and 
die  assigned  processor. 

«.  Chock  (f  Data  bArattahlc 

Tbe  achednling  onit  check*  each  qaeae  which  has  the  node  as  a  sink.  If  all  of  the  queues 
which  enter  die  node  are  above  threshold,  then  the  node  is  ‘input’  ready. 

*.  Chock  (f  Data  Span  It  AvaUaMo 

The  scheduling  unit  checks  each  qaeae  which  has  the  node  as  a  source.  If  all  tbe  queues  have 
enough  space  below  capacity  to  receive  die  data  produced  when  die  node  completes,  then  die  node  is  ‘output’ 
ready.  The  node  is  now  assigned  to  die  node  ready  list. 

c.  Chock  If  ProccaorUAntiaMo 

The  node  ready  list  is  a  Hist-Came-Rrst-Serve  wait  list  Hie  scheduler  moves  along  the  list 
from  head  to  tail  and  checks  for  each  node  in  the  Est  if  the  proper  type  of  processor  is  available.  If  a  processor 
of  the  proper  type  for  die  node  is  available,  die  node  is  assigned  to  that  processor. 

4.  Soto? 

The  node  begins  preparation  for  execution  as  specified  in  the  node  setnp  latency  parameter 
On  cycles).  The  node  is  utilizing  the  processor  control  unit. 
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e.  Lead  lustrucriom 

The  node  loads  the  code  segment  from  memory  to  the  control  unit  This  is  specified  by  the 
nods  inetrucdon  length  perimeter  fin  wank)  and  the  word  commnnicatian  time  (cycles  per  word)  aloes  with 
any  delay  in  stressing  the  memory  nit  where  the  instructions  reside. 

/.  dead  Dm  /  Ceueume  Dm 

The  node  proceeds  lo  read  the  data  from  the  appropriate  queue*,  op  to  the  specified  read 
amount  parameter  for  dm  rpimri  TV  rhrrlnlrr  suaeltancoesly  consumes  dm  from  thr  rpimn  up  to  the 
specified  conaame  amount  parameter.  The  time  spent  for  each  queue  is  specified  by  the  read  parameter  (in 
words)  and  die  word  comrmmicatiaa  time  (cycles  per  wad),  aloof  with  die  achednler  latency  time  On  cycles). 
Additionally,  delays  could  result  if  the  memory  unit  where  the  data  is  stored  is  currently  beint  used  by  another 
processor.  This  event  is  not  complete  until  the  information  for  all  input  queues  has  been  read  and/or 
consumed 

f.  Check  for  Execution  VnitAveOakiMty 

Once  the  data  queues  are  read,  the  node  is  ready  for  execution.  However,  the  execution  unit 
might  be  in  use  by  another  node.  Thus,  the  node  may  be  blocked,  waiting  on  die  execution  unit  Once  die 
execution  unit  becomes  available,  the  node  will  switch  from  the  control  nnit  to  the  execution  unit 

k.  Execute 

The  node  performs  exeention  as  specified  by  the  node  exception  time  parameter  (in  cycles). 

L  Cluck  far  Control  VukAvaOatOUy 

Once  the  node  has  completed  execution,  it  is  ready  to  output  the  resalts  and  remove  itself 
from  die  processor.  However,  die  control  unit  might  be  in  use  by  another  node.  Thus,  die  node  may  be 
NnrineH.  waiting  on  the  control  unit.  Once  die  control  unit  becomes  available,  the  node  will  switch  from  the 
execution  unit  to  the  control  unit 

j.  Write  Dm  I  Produce  Date 

Ike  node  proceeds  to  write  die  data  to  the  appropriate  queues,  op  to  the  specified  write 
amount  parameter  for  the  queues.  The  scheduler  simultaneously  produces  data  to  die  queues  up  to  the 
yrifiwt  produce  amount  parameter.  The  time  spent  for  each  queue  is  specified  by  die  write  parameter  Oh 
words)  and  die  word  comnumiration  time  (cycles  per  word),  along  with  die  scheduler  latency  time  fin  cycles). 
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Additiaeafly,  delay*  coald  malt  if  die  memory  unit  when  the  data  is  Hared  Is  carnally  being  used  by  toother 
processor.  This  event  is  not  completed  until  the  infonnsboc  for  all  output  queues  has  been  written  and/or 
produced. 

k.  Brmkdow* 

The  node  removes  itself  from  die  processor  as  specified  by  die  node  breakdown  latency 
parameter  (in  cycles).  Upon  completion  of  breakdown,  the  node  is  disassociated  from  the  processor.  This 
completes  one  entire  iteration  for  a  node. 

L  Summary 

Table  2. 1  provides  a  summary  of  die  above  listed  events  and  die  proper  calculation  of  their 
processing  times.  The  term  ‘delay'  refers  to  stalls  caused  by  memory  conflicts,  the  inability  to  access  a  queue 
or  instruction  in  memory  due  to  that  memory  module  being  used  by  another  node. 


Table  2.1:  PARAMETER  DEFINITIONS 


Code 

Definition  /  Time 

ExecutionUme 

Node  Execution  Time  Parameter  (in  cycles) 

SetupTIme 

Node  Setup  Latency  Time  Parameter  (in  cycles) 

BreakdownTIme 

Node  Breakdown  Latency  Time  Parameter  (in  cycles) 

InstLen 

Node  Instruction  Length  Parameter  (in  words) 

WriteAmt 

Queue  Write  Amount  Parameter  (in  words) 

ReadAmt 

Queue  Read  Amount  Parameter  (in  words) 

CommTime 

Word  Communications  Time  (in  cycles  per  word) 

LatencyTime 

Scheduler  Latency  Time  (in  cycles) 

LoadTIme 

CommTime  *  InstLen  +  delays 

ReadTime 

[  ( LatencyTime  +  CommTime  *  ReadAmt )  +  delays  ] 
for  all  queues  with  the  node  as  a  sink 

WriteTlme 

[  (  LatencyTime  +  CommTime  *  WriteAmt )  +  delays  ] 
for  all  queues  with  die  node  as  a  source 
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m.  Bv**t  Wrturtimt 

Mott  all  of  the  events  malt  in  a  time  mark  for  the  next  event  Therefore,  several  of  the  events 
cut  be  combined  to  simplify  (be  modeL  Many  of  these  events,  although  dUEereot,  cootribute  to  an  overall  time 
which  leads  itself  to  easier  analysis  of  the  results.  The  respiting  event  reductions  are  defined  as  phases  for 
easy  (fiffereotiatian  with  die  previously  described  events. 

(1)  Inpot  Phase  •  This  event  represents  the  total  than  a  node  spends  on  the  control  gait  for  a 
given  iteration,  from  the  time  it  is  assigned  to  the  time  the  exception  mat  becomes  available.  It  includes  these 
events:  Setup,  Load  Instruction,  Read  Data  /  Consume  Data,  and  Check  for  Execution  Unit  Availability. 

(2)  Execution  Phase  -  This  event  represents  the  total  time  a  node  spends  on  the  execution 
unit  for  a  given  iteration,  from  die  time  the  execution  unit  becomes  available  to  the  time  die  control  amt 
becomes  available.  It  includes  these  events:  Execute,  and  Check  for  Control  Unit  Availability. 

(3)  Output  Phase  -  This  event  represents  the  total  time  a  node  spends  on  the  control  unit  for 
a  given  iteration,  from  the  time  die  control  unit  becomes  available  to  the  time  breakdown  is  completed.  It 
includes  these  events:  Write  Data /Produce  Data,  and  Breakdown. 

Table  22  is  a  summary  of  die  time  calculations  for  these  phases.  The  term  blockage  refers  to 
stalls  caused  by  tihe  inability  of  a  node  to  switch  to  die  other  processing  dement  (control  unit  to  execution 
nnit  or  exception  mrit  to  control  unit)  until  the  node  on  the  other  processing  element  completes  its  operation. 
It  is  to  be  noted  that  die  contention  for  memory  mocfales  during  die  input  and  output  phases  is  implicit  in 
‘Readlime’  and  ‘WriteTime’  respectively. 

IWMe  2.2:  PHASE  TIME  DEFINITIONS 


Code 

Definition  /  Time 

InputTime 

SetupTime  +  Load  Time  +  ReadTime + blockage 

ExecuteTime 

ExecutionTlme  +  blockage 

OutputTime 

WriteTime  +  Breakdownllme  +  blockage 
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«.  KtprtMtUatio*  ComparitOH 


Hgore  2.4  is  s  graphical  representatioc  of  these  times  as  sssocisted  with  a  processor.  Two 
diagrams  sre  given.  The  first  diagram  is  the  detailed  modeL  The  second  diagram  is  the  redoced  model.  As  far 
as  node  achednling  technique*  are  concerned,  the  reduced  model  will  be  used. 


CONTROL  EXECUTION  CONTROL  EXECUTION 

UNIT  UNIT  UNIT  UNIT 


DETAILED  MODEL  TIME  REDUCED  MODEL 


Figure  2.4.  Time  on  Processor  Representation 

2.  Processor  Perspective 

The  processor  can  be  best  described  as  a  finite  state  machine.  Two  finite  state  diagrams  are  given. 
These  state  diagrams  represent  the  same  system,  bat  from  different  points  of  view.  Hgare  2Jisthe  internal 
view  state  diagram.  This  is  the  state  of  the  processor  and  nodes  ss  it  appears  on  the  processor.  Hgore  2.6  is 
die  external  view  state  diagram.  This  is  the  state  of  the  processor  as  it  appears  to  the  ootside  world. 
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Table  2.3  lists  (be  codes  used  to  define  die  states.  Note  that  one  control  unit  code  and  one 


execution  unit  code  ate  required  to  define  a  complete  state. 

Tkble  23:  STATE  DIAGRAM  CODES 


State  Code 

State  Description 

ExePree 

Execution  Unit  is  Free 

ExeBusy 

Execution  Unit  is  Busy  (node  is  in  Execution  Phase) 

ConFree 

Control  Unit  is  Free  (Processor  Available  for  Node  Assignment) 

ConBusy 

Control  Unit  is  Busy  (a  node  is  performing  either  Input  or  Output) 

Conlnput 

Control  Unit  is  Busy  with  a  node  performing  Input 

ConOutput 

Control  Unit  is  Busy  with  a  node  performing  Output 

Several  of  die  transitions  require  further  explanation.  Recall  that  two  different  nodes  can  be 
operating  on  a  processor  at  any  given  time.  One  node  is  performing  execution  on  the  execution  unit,  and  the 
other  node  is  performing  either  input  or  output  an  die  control  unit. 

(1)  In  the  case  where  one  node  is  executing  and  another  is  performing  input,  then  neither 
node  can  go  to  the  next  state  until  both  actions  are  completed,  as  the  nodes  must  swap  the  units  they  are 
currently  occupying,  with  the  node  which  completed  execution  moving  to  the  control  unit  to  perform  output 
and  the  node  which  completed  input  moving  to  the  execution  unit  to  perform  execution.  This  transition  is 
defined  as  ‘Execution  and  Input  Completed’. 

(2)  In  the  case  where  one  node  is  executing  and  another  is  doing  output,  there  are  two 
possible  occurrences.  If  die  node  performing  output  completes  first,  then  it  simply  is  removed  from  die 
processor.  However,  if  die  node  executing  completes  first,  it  stalls  while  waiting  far  the  other  node  to 
complete  output  When  this  second  node  completes  output,  it  will  disassociate  itself  from  the  processor  and 
die  node  which  completed  execution  will  obtain  use  of  the  control  unit  This  transition  is  i  as 
‘Execution  Completes  then  Output  Completes’. 
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Figure  25.  Processor  Internal  View  State  Diagram 


m.  SCHEDULING  TECHNIQUES 


A  key  Sector  in  the  Large  Grain  Data  How  (LGDF)  system  is  the  scheduling  of  the  nodes  in  the  data 
flow  graph  to  the  processors.  This  chapter  will  discuss  important  scheduling  issues  inherent  to  the  LGDF 
fiyv^fi  scheduling  lechniques.  and  jngfoygmats 

A.  TERMS 

Sgwftft  important  nr*p#«  are  "«*H  in  th»  analysis  "f  >h»  scheduling  twrhmqnwg 

1.  Throughput 

Throughput  is  die  total  number  of  instances  completed  in  a  given  time  interval.  Throughput  is 
uniform  if  the  time  interval  between  the  completion  of  consecutive  graph  instances  is  constant. 

2.  Response  Time 

The  response  time  is  the  time  it  takes  to  complete  one  iteration  of  a  graph.  This  is  the  actual  time 
from  die  beginning  of  graph  processing  to  die  end  of  graph  processing  for  a  given  graph  iteration.  The 
response  time  is  uniform  if  each  graph  instance  completes  in  a  constant  time. 

B.  COMMUNICATION /COMPUTATION  OVERLAP 

An  important  aspect  of  this  LGDF  model  is  the  dual  unit  processors.  Each  processor  has  a  control  unit 
and  an  execution  unit  Different  nodes  can  be  operating  simultaneously  on  different  units  of  die  same 
processor.  All  communications  and  node  control  functions  take  place  on  die  control  unit  It  is  desirable  to 
have  these  control  and  communication  functions  done  concurrently  with  the  execution  of  another  node.  This 
is  known  as  communication  /  computation  overlap.  Ideally,  the  communications  and  control  functions  would 
completely  overlap  with  the  execution. 

To  felly  appreciate  die  techniques,  the  concept  of  communication  /  computation  overlap  must  be 
introduced.  This  can  best  be  shown  graphically.  Previously,  Figure  2.4  displayed  one  node  upon  a  processor. 
However,  in  this  LGDF  model,  two  nodes  will  normally  be  on  a  processor  simultaneously.  There  are  many 
possible  situations  which  can  occur. 
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Many  of  these  situations  are  described  graphically  in  detail  Note  tint  these  figures  display  the  state  of 
the  procesaor  in  the  middle  of  activities.  The  node  designated  ‘node  O’  has  been  executing  far  some  time, 
‘node  1  ’  has  just  been  assigned  to  tbe  processor. 

In  die  following  descriptions,  the  term  ‘communication’  represent  all  communications  and  control 
functions  and  latency  times.  The  term  ‘computation’  represents  the  actual  processor  exception,  These  two 
terms  are  selected  as  they  are  prevalent  in  current  literature. 

1.  Perfect  Communication  /  Computation  Overlap 

Figure  3.1  displays  die  perfect  overlap  condition.  This  condition  is  rather  unrealistic  as  it  is  highly 
unlikely  that  die  communications  would  perfectly  match  the  computation.  However  this  is  die  theoretical 
case. 


■rimi 

node  1 

]  assigned  \ 

node  2 
f  assigned 

^  \ 

CONTROL 

UNIT 

node  1 
input 

nodeO 

output 

node  2 
input 

node  1 
output 

1 

EXECUTION  node0 

UNIT  execute 

node  1 
execute 

node  2 
execute 

Figure  3.1.  Perfect  Communication  /  Computation  Overlap 
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2.  Good  CtauuictiiQB  /  teprtika  Otertap 


Figaro  3  2  dUplays  good  overlap  conditions  (assenting  that  perfect  overlap  will  oat  occar).  Ia  this 


CMC,  communication  U  completely  overlapped  with  compotatioo.  This  sitoatioD  will  tend  to  occur  when  the 
tneraory  access  speed  ie  fisat  compered  to  processor  speed,  or  the  ipstractioos  represented  by  the  nodes  raqaire 
large  amoonts  of  processing  compared  to  tbe  amomu  of  data  transfer. 


Several  conditions  are  displayed  in  Fignre  32.  The  heavily  shaded  portion  represents  a  blocked 
control  mat  Node  2  has  completed  its  input,  hot  it  cannot  hefin  execution  becanre  node  1  has  not  completed 
execution-  The  lightly  shaded  portion  represents  an  idle  control  rail.  In  this  case,  no  node  is  ready  to  begin 
processing.  Neither  of  these  concStions  is  bad  since  the  execution  unit  is  operating  at  its  full  capability. 
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3.  Foot  CoMirictHm  /  C«p»attai  Oitriip 

Hgare  33  displays  poor  overlap  condHkmt  In  this  cut,  commanicatioo  it  not  completely 
overlapped  with  compo  tattoo.  Hit  itmlot  wfll  tend  to  occur  when  the  memory  access  quod  it  stow 
compand  to  processor  speed,  or  (he  jratnuticns  represented  by  the  nodes  nqoire  tmall  amoonts  of 

pnv—iwj  cwnpniH  to  tlw  «nw»m  rf  il«t«  Iwmfcc 


Figure  13.  Poor  Conununkatfam  /  Computation  Overlap 


Several  contfitiont  are  tfisplayed  in  Figme  33.  The  heavily  shaded  portion  iqmmu  t  blocked 

mtfr  Wwl»  1  tm  f«npl»«wl  fTffwtiwi  >iw»  f»mW  rmnnwio  iwfppf  wnfil  nfyif  ?  nwnflftM  input 

The  Kgfatly  shaded  portion  represent!  an  idk  execotion  mrit  In  this  cate,  the  control  miit  is  bosy  forcing  the 
execution  unit  to  be  idle.  At  outpnt  hat  priority  over  input  in  the  model,  the  beginning  of  execution  it  farther 
delayed  until  die  next  ready  node  completes  inpot.  These  conations  represent  bad  performance  became  no 

M»fal  wmmlinii  it  hrin|  pwfnntwH 

4  Realistic  Commanicatkm  /  Compatation  Overtop 

In  actual  processing,  it  it  likely  that  ‘good’  overlap  will  occur  at  timet  and  *poor’  overlap  will 
occur  at  other  times.  The  variooa  scheduling  techniques  to  be  ducosscd  latex  In  this  chapter  will  attempt  to 
force  die  system  to  have  more  ‘good’  overlap  node  to  processor  assignments  dun  ‘poor’  overlap  node  to 
processor  attignmo'  ^.  <  h?;) »  not  necessarily  an  easy  undertaking  at  in  general,  all  nodes  have  wide  ranges 
of  execution  times  and  «wyiirwH  vohnnei  of  cotnmsnicaitoo. 
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5.  Revtoed  Plate  State  MacUae 

Rgare  15  provided  a  Male  dUguun  lo  describe  the  processor.  With  the  possible  overisp  cottStiaas 
defined  ia  die  above  rBigrsms,  an  expanded  state  digram  can  be  provided  to  more  accaraiety  describe  the 
model,  provided  in  Figure  3.4.  Table  3.1  provides  the  processing  mat  state  codes.  Once  again,  an  exacatkn 
■dt  code  and  a  control  ask  code  are  necessary  to  define  the  system  male. 


Tkble  XI:  STATE  DIAGRAM  CODES 


State  Code 

State  Descriptioo 

Execution  Unit  is  Idle 

ExeCalc 

Execution  Unit  is  rialrailaring 

ExeBlock 

Execution  Unit  is  Blocked  with  a  node  waiting  for  the  Control  Unit 

Conldle 

Control  Unit  is  Idle  (Processor  Available  for  Node  Assignment) 

Conlnput 

Control  Unit  is  Busy  with  a  node  performing  Input 

ConOutput 

Control  Unit  is  Busy  with  a  node  performing  Output 

ConBlock 

Control  Unit  is  Blocked  with  a  node  waiting  for  the  Execution  Unit 

In  node  to  proccsaonchcdulmg.  it  is  important  to  in  ini  mi  ir  the  execution  mat  idle  states  (Exeldle) 
and  execution  unit  Mocked  states  (ExeBlock).  Ignoring  the  end  pointe  of  operation  (where  there  meet  be  tome 
execution  unit  idle  time),  die  foal  is  to  cycle  continuously  through  the  fallowing  mates  (due  cycle  is 
highlighted  on  the  state  diagram): 

->  (  Conldle/ExeCalc )  -> 

->  ( Conlnpot /ExeCalc )  -> 

->  ( ConBlock/ ExeCalc )  -> 

~>  (CanOntpnt/ ExeCalc )  -> 

->  recycle 
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Figure  3.4.  Expanded  Processor  State  Diagram 
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C  CONTENTION 


Contention  refers  Ip  the  inabilty  for  i  coanrnarications  operation  lo  occar  between  a  processor  and  a 
memory  modnk  (fee  to  the  memory  modale  being  atilized  by  another  processor.  This  molts  in  a  deity  of  tie 
node  op  the  processor  reqaesting  m  of  the  ■"*a>i*y  m<v>nif 

A  qaeae  am  only  be  accessed  by  one  node  at  a  tune.  Therefore,  if  the  scarce  node  wants  to  write 
data  and  the  rink  node  wants  to  read  data,  ooewoold  be  delayed  tatil  the  other  completes  its  ament  opcration- 

2.  Memory  Contention 

Memory  contention  ia  genetally  more  brand  than  queue  contention,  since  qaeae  contention 
represents  two  nodes  trying  to  access  (be  same  set  of  locations  in  tbe  memory  modale.  With  memory 
contention,  one  processor  is  accessing  a  node  or  qaeae  in  a  specific  memory  mat  This  could  be  either  reading 
from  a  queue,  writing  to  a  different  qaeae,  or  loacfing  a  node  program.  While  this  operation  is  taking  piece, 
no  other  qaeae  or  node  program  can  be  accessed  by  another  processor  from  tbe  same  memory  modale. 

D.  FIRST-COME-FIRST-SERVE  SCHEDULING  TECHNIQUE 

Rrrt-Cflme-Hrat-Serve  (HCPS1  srlwrtaKng  can  mnw  property  ^ 
are  assigned  to  processors  in  the  order  m  winch  diey  are  made  ready.  Then  is  no  forethought  or  attempt  at 
optimization. 


1.  Advantages 


a.  SimpBrity 

Since  there  is  no  special  order  to  the  assignment  of  nodes,  die  amount  of  overhead  (software 
and  additional  hardware)  reqoired  for  the  assignment  is  negligible. 

A  Froctnor  Vtihatiom 

Processors  will  be  in  use  constantly.  As  long  as  nodes  are  in  tbe  ready  Est,  they  will  be 
assigned  to  available  processors. 
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c.  MbUmsl  Quota  CwlndM 

As  a  taction  of  die  PCFS  algorithm,  the  queue  contention  is  minimized  This  is  due  to  the 
fret  that  a  node  carat*  begin  input  entS  all  queues  into  the  node  aie  reedy.  Therefore,  die  race  node  wifi 
write  data  to  a  queue.  Then,  die  queue  would  be  ready  to  be  rend  by  dm  sink  node. 

4.  FamM  Toltrmmc* 

With  an  FCFS  implementation  of  scheduling,  the  system  is  ink  tolerant,  Since  nodes  wifi 
not  be  amigoed  to  processors  nodi  all  data  is  ready,  no  deatflocks  will  occar. 

2.  Pitad vantages 

a.  Camautmkadam  /  CampmMiam  Overtop 

There  h  no  guarantee  of  good  comm— ikation/compatationoveriapwilhPCPS,  since  nodes 
are  placed  on  die  neat  available  processor,  regardless  if  whether  die  communication  times  and  computation 
times  can  be  made  to  overiap. 

I.  UnfrHBctmbia  Rnpotua  Tim  am 4  Tkraugkpmt 

With  die  communication  /  computation  overiap  that  is  likely  to  change  from  one  graph 
iteration  to  the  next,  it  is  difficult  to  pretfict  the  graph  response  time  and  throughput. 

c.  Mumarj  ComtanMcm 

Since  nodes  are  assigned  to  processors  when  they  are  ready,  there  is  no  way  to  predict  which 
memory  modules  would  be  required  at  any  time. 

It  can  be  expected  that  if  conummication  time  is  very  small  compared  to  computation  time  for 
most  nodes  in  die  graph,  then  POPS  can  perform  well  since  the  effects  of  die  disadvantages  will  be  minimized. 
Conversely,  if  die  communkation  time  is  large  compared  to  computation  time,  then  die  disadvantages  will 
be  accentuated.  We  expect  the  latter  to  be  the  case  precisely  because  the  graphs  are  LGDF. 
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E.  REVOLVING  CYLINDER  SCHEDULING  TECHNIQUE 

The  Revolving  Cylinder  (RQ  technique  is  designed  specifically  far  Large  Grain  Data  Flow 

systems.  It  is  assumed  that  the  application  requires  the  specified  data  flow  graph  to  be  executed  continuously. 

The  premise  is  that  at  any  given  time  the  nodes  afooe  graph  equivalent  must  be  processed.  This  means 
that  not  all  of  dm  nodes  will  be  woriring  on  the  nme  data  set,  bat  aoe  instance  of  each  node  is  ready  to  werir 
on  a  data  set  With  the  RC  technique,  this  one  graph  equivalent  will  be  mapped  to  the  available  processors. 
Thu  mapping  k  known  as  tire  cylinder  The  term  revolving  cylinder  refers  to  the  fact  that  additional  cylinders, 
exactly  the  same  as  the  first,  can  be  placed  one  after  another.  Essentially,  the  execution  resembles  a  rotating 


There  are  four  variations  of  tfae  revolving  cylinder  technique  that  will  be  described.  Tfae  first  variation 
to  be  presented  is  Start  After  Haiah  (SAP).  The  second  variation,  Start  After  Start  (SAS)  jpes  die 
synchronization  arcs  in  a  different  manner.  In  both  SAP  and  SAS,  there  is  oo  requirement  that  nodes  always 
beachednlcd  to  the  same  processor.  However.  SAP  and  SAS  can  be  farther  modified  by  binding  the  nodes  to 
specific  processors.  These  variations  are  termed  SAR>  rod  SASb  respectively. 

1.  Index  Assignmeatand  Synchronization  Arcs 

hi  a  given  sfice,  many  of  die  nodes  will  not  be  woriring  on  die  same  set  of  data,  therefore,  die  nodes 
are  assigned  an  index  to  reference  die  data  set  that  node  is  currently  operating  on.  Once  die  indices  are 
determined,  synchronization  arcs  are  generated.  These  synchronization  aics  are  control  signals  which  enftace 
die  cylinder  structure. 

Figure  3.5  is  a  simple  data  flow  graph  which  is  scheduled  on  two  processors.  Note  that  for  die 
demonstration  of  die  RC  technique,  the  only  node  parameter  is  die  execution  time.  Also  note  that  die  input 
and  output  nodes  do  not  get  mapped  to  the  cylinder.  The  node  identifier  is  the  letter  and  the  node  execution 
time  is  die  number  inside  the  node.  In  the  processor  mapping,  the  index  is  the  rmmfaer  in  parenthesis. 

Two  cylinders  are  mapped  Indices  are  assigned  to  the  first  cylinder  as  fallows.  Ignore  tfae 
synchronization  arcs  in  determining  data  dependencies.  The  first  node  mapped  is  ‘A’.  Therefore,  it  is  given 
an  index  of ‘O’.  Nodes 'B’ and  ‘D’  depend  on  the  results  of ‘A’.  Node  ‘B’  appears  after  node  ‘A’  on  the  same 
processor.  Therefore,  it  can  wrek  on  die  same  data  set  as  ‘A’.heace  an  index  of  *0’.  However,  node  ‘D*  begins 
processing  at  the  same  time  as  node  ‘A’.  Since  it  depends  oo  the  results  of  ‘A’,  node  ‘D’  most  be  operating 
on  a  previous  data  set,  hence  an  index  of  ‘-1\  Node  ‘C’  depends  only  on  node  ‘B’  far  data.  Although  it  is 
scheduled  to  a  different  processor,  node  *C*  does  not  start  until  node  ‘B’  completes,  therefore,  it  can  still 
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operate  an  the  same  data  aet  at  ‘B\  Urns  an  index  of  *0’.  Node  *E’  depends  an  data  from  both  nodes  ‘C’  and 
*D\  ft  it  assigned  an  index  of  ‘-1*  far  two  reasons.  Hat.  node  ‘D’  has  an  index  of  *-l\ Node  ‘E’  starts  after 
‘D\  to  it  can  have  die  tame  index,  *-l\  Second,  node  ‘C’  is  processing  at  die  same  time  as  node  ‘E\ 
Therefore,  node ‘E’ most  be  operating  an  a  previous  set  of  data.  Therefore,  since ‘C*  has  an  index  of  W,  then 
‘E’  most  have  an  index  of  *-1’.  The  second  cylinder  it  mapped  in  the  tame  maimer  at  die  first,  but  with  the 
indices  increased  by  one. 


LEGEND  (additions  to  figure  2.1) 


Synchronization  Arc 


•  Token 


TIME 
0 _ 

1 _ 

2 _ 

3  _ 

4  _ 

5  _ 

6  _ 

7  _ 

8  _ 

9 _ 
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Processor  1 


A  (0) 


B  (0) 


E(-l ) 


A  ( 1 ) 


B(l) 


E(0) 


Processor  2 


D(-l) 


C(0) 


D(0) 


C(l) 


Figure  3i.  Data  Flow  Graph  and  Processor  Assignment 


This  is  the  Start  After  Finish  (SAP)  version  of  the  revolving  cylinder  technique  for  generating  the 
synchronization  arcs.  The  sink  node  at  die  head  of  die  synchronization  arc  will  be  allowed  to  start  after  die 
source  node  at  the  tail  of  the  arc  completes.  The  synchronization  arcs  are  generated  as  Mows.  Nodes  ‘A’, 
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*B*  and  *C’  operate  inconsecutive  osdw  on  dte  same  instance.  Therefose.  they  maintain  data  dependence  and 
do  synchronization  arcs  are  necessary  between  them.  Likewise,  nodes  ‘D*  sod  ‘E’  maintain  seek  s  data 
dependence.  However,  in  this  mapping,  node  ‘C’  executes  on  the  same  process  as  node  *D\  To  set  ep  the 
cylinder,  node  ‘C’  most  wait  for  one  instance  of  node  ‘D’  to  eiecate.  Therefore,  a  synchronization  am  is 
generated  between  *C*  and  ‘D*.  Looking  at  the  whole  cylinder,  node  ‘A*  cannot  start  executing  until  node  *E’ 
of  the  previous  instance  completes.  Therefore,  a  synchronization  am  exists  between  ‘E’  and  ‘A*. 

Tokens  on  synchronization  arcs  represent  s  counter.  The  tokens  listed  represent  the  initial  length 
parameter  of  the  synchronization  am  as  defined  in  the  previous  chapter  in  the  section  on  qnenes.  It  is  obvious 
dnt  these  tokens  are  needed.  The  synchronization  arcs  define  the  need  for  node  ‘E’  to  complete  before  node 
‘A*  beg his.  However,  no  instance  of  ‘E’  can  ever  occur  until  one  instance  of  node  ‘A*  executes.  Therefore, 
the  initial  token  will  allow  the  process  to  start  Likewise  for  the  token  on  the  synchronization  am  between 
nodes  ‘D’  and  *C\  After  multiple  instances  of  the  gmph  have  executed,  the  cylinder  should  look  as  it  is  with 
all  nodes  at  the  proper  index. 

Showing  two  cylinders  bock  to  beck  illustrate  some  inqxxtant  concepts  of  die  RC  algorithm.  Hist, 
it  takes  a  number  of  cylinders  to  complete  a  graph  iteration.  This  quantity  is  equal  to  the  range  of  different 
indices  in  die  cylinder.  The  required  time  is  equal  to  the  number  of  cylinders  multiplied  by  the  time  to 
complete  one  cylinder.  In  this  example,  two  cylinders  are  required.  Note  that  the  range  of  indices  is  two  (from 
0  to  1).  Therefore,  die  time  to  complete  one  graph  insnmee  is  ten  cycles  (two  cylinders  multiplied  by  five 
cycles  to  complete  a  cylinder).  Note  that  this  is  longer  dan  the  minimum  possible  time  to  complete  die  graph 
on  two  processors  which  is  seven  cycles  (baaed  oc  longest  path)  in  this  example.  However,  it  is  guaranteed 
that  it  will  take  ten  cycles  to  complete  each  and  every  instance.  It  is  also  guaranteed  that  one  instance  wifi 
complete  during  each  cylinder.  In  this  example,  one  iteration  completes  every  five  cycles.  Therefore,  the 
revolving  cylinder  technique  results  in  uniform  throoghpot  and  uniform  response  time. 

The  above  example  is  rattier  simplistic  and  not  representative  of  die  Large  Grain  Data  How  model 
studied.  In  the  LGDF  model,  the  nodes  are  not  operating  in  distinct  blocks.  One  node  actually  begins 
preparing  to  execute  an  a  processor  before  die  previous  operating  node  is  finished.  Therefore,  determining 
the  actual  indices  and  arcs  is  not  a  simple  matter  on  even  a  moderately  complex  data  flow  graph.  However, 
the  start  after  finish  synchronization  arc  generation  and  revolving  cylinder  assignment  technique  is  still  valid. 
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2.  AdTMttff 

«.  FredUtable  Performance 

Since  uniform  cy  linden  are  assigned  to  the  processor  set,  the  system  will  have  more 
predictable  throughput  and  average  response  time. 

b,  Maximize  Cwaslcefai  /  Computation  Overlap 

Tlte  nodes  in  (be  cylinder  can  be  pieced  to  achieve  maximum  overiap  of  cammnoication  time 
with  computation  tone.  If  die  communication  cost  of  die  system  is  low,  there  will  be  tittle  gain  to  die 
revolving  cylinder  technique. 

e.  Reduce  Memory  Contention 

Once  die  cylinder  is  mapped,  it  can  be  determined  which  nodes  and  queues  most  be  accessed 
at  the  same  time.  Therefore,  nodes  and  queues  can  be  mapped  to  different  memory  modules  to  ensure  that 
they  are  not  active  at  the  same  time,  redncmg  memory  contention.  This  cocld  be  a  difficult  task  as  qneoes  are 
operated  on  by  different  nodes  at  different  times  However,  any  redaction  of  memory  contention  will  help. 
This  is  impossible  with  FCFS  as  it  is  never  known  which  operations  wiB  proceed  at  any  given  time. 

3.  Disadvantages 

a.  Increased  Overhead 

Overhead  is  significantly  increased  with  the  requirement  to  generate  and  tack  die 
synchronization  arcs.  Also,  it  is  important  to  generate  proper  tokens  on  the  synchronisation  arcs  to  assnre  that 
deadlocks  will  not  occnr  dne  to  dependencies  which  cannot  be  met 

b.  No  Overlap  Between  Cylinder  SUces 

In  this  LGDF  model,  all  nodes  have  some  input  and  some  output  time.  However,  with  die 
start  after  finish  technique,  die  first  node  in  die  next  cylinder  cannot  begin  processing  until  die  last  node  in 
tihe  current  cylinder  completes.  Hats,  there  is  no  possible  communication  /  computation  overlap  between 
cylinder  instances. 

e.  Unbalanced  Loads 

A  related  issue  to  die  non-overlap  between  cylinder  instances  is  the  issue  of  unbalanced 
loads.  An  ideal  cylinder  would  have  the  processors  completely  load  balanced.  That  is,  all  processors  would 
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complete  processing  at  the  tame  instant.  However,  this  is  asnally  not  the  case.  The  next  cylinder  cannot  begin 
processing  aalil  the  lest  node  of  the  current  cylinder  completes  processing.  Therefore,  if  the  last  node  oo  one 
processor  completes  long  after  the  nodes  on  the  odier  processors,  the  idditiontl  processors  wonld  remain  idle 
for  rosodtd  periods  ^  the  tf»n«ghpit  fedneed 

4.  Queue  ConUnMom 

Queue  contention  can  be  nuirimraed  tfaroagh  proper  mapping.  However,  it  is  now  a  factor  to 
be  *  into  <wi«ii<w«>i««i 

4.  Alternate  Revolving  Cylinder  Scbedaing 

An  alternate  version  of  tbe  revolving  cylinder  technique.  Start  After  Start  (SAS),  generates  die 
synchronization  arcs  based  on  when  a  sonree  node  node  begins,  rather  than  after  it  ends.  This  eliminates  the 
lack  of  communication  /  computation  overlap  between  consecutive  cylinder  mapping* 


IV.  RESULTS  AND  ANALYSIS 


This  <*apfw  ** 10  analysis  of  die  initial  results  for  the  use  of  die  Revolving  Cylinder  algorithm.  The 
programs  used  to  generate  die  results  are  described  folly  in  [Ref.  121.  Figure  4.1  is  a  diagram  of  the 
relationship  of  the  programs  used  to  generate  the  results. 


SIMULATION  RESULTS 


machine  configuration 
communication  costa 
input  data  rate 


Figure  4.1.  Program  Usage  to  Produce  Results 


A.  INITIAL  TRIALS  ON  TEST  GRAPH 

The  initial  tests  were  performed  on  a  simple  data  flow  graph  to  generate  baseline  results  for  die 
Revolving  Cylinder  algorithm.  This  simple  graph  consisted  of  one  input  node  and  output  node  (execution 
time  ■  OX  and  15  uniform  instruction  nodes  (execution  time  » 10000).  The  nodes  had  no  setup  or  breakdown 
latency,  and  an  instruction  size  of  zero.  Therefore,  die  only  communication  is  due  to  die  transfer  of  data 
between  processors  and  memory.  The  produce  amount,  consume  amount,  write  amount,  read  amount,  and 
threshold  amount  were  all  equal  for  a  given  queue.  However,  this  number  was  different  for  the  queues  in  the 
system  (either  1000, 2000,  or  4000  winds).  The  queue  capacity  is  eight  times  this  amount.  Several  mapping* 


of  dtu  graph  were  nsad  at  verities  oamaenicatian  costs  over  Usee,  four,  and  five  processors.  Rgnre 4.2  shows 
the  teat  with  hi  bbbi»  representing  the  Qsutiict  for  the  ) 


The  mappings  of  die  nodes  to  processors  for  this  graph  was  determined  manually,  attempting  to 
maximize  the  communication  /  compotatioc  overlap.  It  is  noted  that  in  the  all  mappings  for  force  processors, 
the  processors  were  uniformly  load  balanced,  each  processor  having  exactly  die  same  mapping  (as  for  as 
computation  and  communication  times)  as  die  other  two  processors.  The  mappings  for  five  processors  were 
firidy  well  load  balanced  with  exactly  three  nodes  on  each  processor.  However,  die  amount  of  communication 
overlap  an  each  processor  varied.  The  mappings  on  four  processors  were  more  difficult  to  determine  as  die 
nodes  do  not  map  evenly  to  processors. 

AH  mappings  were  tested  at  four  (Efferent  comnmnicartpn  costs,  one,  two,  three,  or  four  cycles  to 
transfer  one  word  of  data  between  a  processor  and  memory.  The  schednler  latency  was  set  at  zero.  For  this 
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graph,  die  yielded  communication  /  computation  ratios  are  0.4, 0.8,  1.2,  and  1.6  respectively.  The  simulation 
was  set  to  compote  the  maximum  throughput.  Along  with  a  Fost'Came-Fust-Serve  (FCFS)  test  for  the  graph, 
each  mapping  was  tested  using  four  different  variations  of  the  Revolving  Cylinder  (RQ  scheduling  technique 
as  described  in  the  previous  section:  Start  After  finish  (SAF),  Start  After  Start  (SAS),  Start  After  Finish  with 
nodes  bound  to  processors  (SAFb),  and  Start  After  Start  with  nodes  bound  to  processors  (SASb). 

In  these  tests,  the  member  of  memory  modnles  was  eqnal  to  the  timber  of  arithmetic  processors  in  the 
system.  All  nodes  mapped  to  a  given  processor  were  asssigned  to  one  memory  module.  The  queues  were 
assigned  to  the  memory  modnle  to  which  their  sink  node  is  assigned  For  FCFS  tests,  die  same  memory 
assignments  were  used  as  for  the  RC  analysis  to  allow  for  direct  comparison. 

One  important  note  most  be  made  about  the  charts  which  follow.  Although  there  are  several  mappings 
for  each  of  the  schednling  variations,  only  the  result  of  the  best  mapping  is  shown.  At  different 
communication  costs,  die  best  mapping  would  often  be  different  Even  at  die  same  communication  cost, 
various  schednling  techniques  could  be  better  on  different  mappings. 

The  first  test  results  were  for  a  contention  free  situation.  This  is  an  ideal  result  where  a  node  or  queue 
is  always  able  to  access  the  memory  unit  where  its  required  data  is  located.  Figure  4.3  shows  the  results  of 


Figure  4.3.  Test  Graph  on  3  Processors  (Contention  Free) 


In  this  test,  it  is  apparent  that  with  no  contention,  FCFS  provides  die  best  throughput.  S  AS  can  come 
close  to  FCFS,  but  SAF  is  lacking,  due  to  die  inability  to  overlap  consecutive  cylinders.  Processor  Mnrimg 
yields  no  significant  difference. 
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The  next  test  is  with  nonary  contention  a  facta.  Figures  4.4, 4.5,  sad  4.6  provide  the  results  for  three. 


fear,  and  live  processors  respectively. 


04  14  1j> 

OOMMUMCSTIOH  /  COMP  Iff /fflOH 


Figure  4.4.  Test  Graph  on  3  Processors  (with  Contention) 


Figure  4.5.  Test  Graph  on  4  Processors  (with  Contention) 


Figure  4.6.  Test  Graph  on  5  Processors  (with  Contention) 

Several  points  are  apparent  At  low  commnnication  costs,  FCFS  will  provide  high  throaghpat. 
However,  as  the  communication  cost  increases,  FCFS  througlqmt  sharply  decreases.  The  RC  techniques  show 
that  increased  throughput  over  FCFS  is  possible,  especially  as  the  communication  cost  increases.  This  is  due 
to  being  able  to  map  die  nodes  such  that  contention  is  minimised.  However,  as  to  determining  die  best 
variation  of  this  technique,  there  is  no  consensus  of  results.  Certain  variations  proved  better  for  certain 
mappings.  As  stated  previously,  these  charts  show  the  best  result  for  the  scheduling  variation.  These  results 
are  not  necessarily  from  the  same  mapping.  Furthermore,  only  three  or  four  mappings  were  used  and  there 
are  many  more  mappings  which  are  possible. 
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XSN 


Hgnre  4.7  is  s  plot  showing  the  effects  of  contention  versus  no  contention  for  FCFS.  It  can  be  easily 
seen  that  contention  is  a  rayor  consideration,  except  at  very  low  communication  costs. 


Figure  4.7.  FCFS  Contention  versus  No  Contention 
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Hpn  4.8  is  •  plot  ihowing  the  effects  of  contention  verna  no  coattation  far  RC.  Note  the*  aJtboogh 
contention  Mill  effect*  the  dwoeghpet.  the  effect*  ate  reach  radeced  compared  with  FCFS.  Aa  with  the 
prevkn*  chaita,  the  beat  reeelta  frare  RC  ere  plotted. 


Figure  4.8.  RC  Contention  versus  No  Contention 


To  demooatratc  the  improvements,  Hgnre  4.9  ia  e  plot  comparing  die  contention  free  cere  and  the 
contention  care  for  both  PCFS  end  RC.  The  Throughput  Decreaac’  UUie  difference  between  the  contention 
free  and  contention  care  tfivided  by  the  throughput  of  the  contention  free  care  for  the  given  amber  of 
procesaon  end  coureureication  /  competition  ratio,  end  converted  to  e  percentage.  This  peaccntege  repreaenta 
the  degradation  canaed  by  adtfing  memory  contention  to  the  model,  with  a  higher  figreewprearetting  higher 
degradation.  As  expected,  it  is  seen  that  aa  die  conaremication/conipetation  redo  increases,  die  degradation 
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due  to  contention  also  inrrcatr*  The  nrnbcr  of  pcoceaaon  plays  only  a  small  part  in  the  ratio.  An  important 


Figure  4.9.  Throughput  Decrease  Due  to  Contention  for  FCFS  and  RC 


B.  TESTS  ON  AN  ACTUAL  APPLICATION  GRAPH 

Hie  RC  techniques  were  next  practiced  an  an  actual  application  graph.  The  graph  chosen  was  die 
'Active  Sonobnoy’  graph  provided  by  AT&T  for  the  ECOS  simulator  of  die  EMSP  system  (Ref.  13],  and 
modified  to  fit  the  described  system  model.  As  with  die  test  graph,  the  node  setnp  and  breakdown  latencies 
are  zero,  the  node  instruction  size  is  zero,  and  die  scheduler  latency  is  zero.  The  produce  amount,  consume 
amount,  write  amount,  read  amount,  and  threshold  amount  are  the  same  for  a  given  queue,  with  the  capacity 
eight  times  this  quantity.  The  number  of  memory  modules  is  equal  to  the  number  of  processors,  with  all  nodes 
mapped  to  a  processor  assigned  to  the  same  memory  and  queues  assigned  to  the  memory  of  die  sink  node. 
The  simulator  is  set  to  determine  the  maximum  throughput  of  the  system. 

Figure  4.10  shows  the  active  sonobuoy  graph.  The  node  execution  times  and  queue  quantities  are  given 
at  die  bottom  for  each  ‘level’  of  the  graph,  as  all  nodes  and  queues  on  each  level  are  the  same.  The  exception 
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is  for  the  final  ‘level’  of  node*  where  the  execarton  time  is  below  each  node  sad  the  qaeae  quantity  for  all 
qaenesiato  threnode,  aa  the  qnamitiasdfflitr.  Note  this  graph  prwides  for  a  high  degree  of  parallel  execution. 


Figure  4.10.  Active  Sonobuoy  Graph 

Only  ooe  mapping  on  each  of  three  different  processor  amngemeota  (four,  eight,  and  thirteen)  was 
tested.  Yet  in  diet  small  test  sample,  die  results  for  this  graph  generally  mirror  the  results  for  die  test  graph. 
With  low  comnnnricatkins  costs,  FCFS  yields  good  results  and  there  is  no  gain  with  RC.  However,  as 
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comnankathm  coats  hexene,  EC  am  yield  increasing  fcaprovesnents.  Once  spin,  there  is  no  concnncnce 
as  to  which  variation  of  RC  will  consistcndy  yield  the  best  results.  Far  exact  results,  see  [Ref.  11). 

C  ADDITIONAL  RESULTS 

la  both  of  die  graphs  tested,  rather  result  viewed  is  die  coefficient  of  variation.  This  is  a  measure  of 
die  regularity  of  completion,  or  laptme  time  of  graph  instances  The  lower  this  munber,  the  doeer  tfae 
response  times  of  all  the  meaaeied  graph  inatanrea  to  tfae  average  response  time.  With  both  graphs,  the 
coefficient  erf  variation  far  RC  is  consistently  less  than  PCFS.  With  some  mappings,  it  is  possible  to  reduce 
die  coefficient  of  variation  to  aero.  However,  it  moat  be  noted  tint  although  RC  is  an  improvement  over 
PCFS,  the  results  for  PCFS  were  low  to  begin  with. 
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V.  CONCLUSION 


This  thesis  provides  a  model  for  a  Large  Grain  Data  How  (LGDF)  computer  system.  This  system 
utilizes  two  put  processors,  where  one  put  hamfles  communications  and  the  other  handles  execution.  The 
applications  naming  on  this  compnter  system  are  modeled  as  data  flow  graphs  consisting  of  nodes  and 

A  scheduling  technique  known  as  the  Revolving  Cylinder  (RC)  is  described,  with  fear  variations  In 
tests  versos  simple  Hnt*Come-Fint*Serve  (KIPS)  scheduling,  it  is  shown  that  RC  can  lead  to  increased 
throughput,  especially  as  communication  costs  increase.  However,  it  is  seen  that  selecting  fee  appropriate 
mapping  is  not  a  simple  task,  and  a  good  mapping  for  one  communication  cost  is  not  necessarily  a  good 
mapping  for  another  conommication  cost  It  is  also  shown  that  none  of  the  variations  of  RC  are  consistently 
better  than  any  other  variation,  and  are  dependent  on  the  mapping. 

A.  EXPANDED  TESTING 

In  this  research,  the  purpose  was  to  generate  baseline  results  which  allow  for  farther  expansion.  Many 
additional  tests  most  be  condocted  to  folly  analyze  the  effectiveness  of  fee  RC  technique.  Several  important 
iwwies  imist  be  sfodied 

For  nodes,  mall  tests,  fee  wiVmrtinn  j;  ^  t  fen>  is  mi  iwmnty  fnntwHinB 

wife  retrieval  of  fee  instructions  from  memory.  The  inpot  and  output  nodes  have  no  bearing  on  processing 
wife  the  execution  time  set  to  zero. 

For  queues,  in  all  cases  the  prodnce/consnme,  read/write,  and  threshold  amounts  were  always  constant 
Varying  these  quantities  could  have  a  mryor  impact  on  graph  execution. 

All  latencies,  node  setup  and  breakdown,  and  scheduler  latency  were  zero.  This  reduces  the 
comimmication  overhead, 

All  tests  were  made  wife  fee  ntnnber  of  memory  modnles  equal  to  fee  number  of  arithmetic  processors. 
Tests  need  to  be  made  with  varying  numbers  of  memory  modules  to  folly  analyze  fee  effects  of  manxy 
contention. 

All  tests  were  based  on  maximum  graph  throughput.  Tests  need  to  be  completed  wife  various  graph 
activation  rates. 
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B.  FUTURE  RESEARCH 


The  primary  ana  for  tow  maeeich  woric  regarding  RC  is  in  the  area  of  mapping  The  results  of  this 
paper  tbtm  that  a  mapping  for  RC  can  be  focad  which  improves  performance  over  FCf$.  However,  these  is 
no  —Ay  <**s*»fag  thu  msppmg  <*m  »n  the  many  variables  inwnived-  Accurate  characterisation 

of  the  cylinder  mapping  is  necessaiy  to  develop  a  metric  for  a  good  mapping.  This  wonld  imply  establishing 
a  condation  between  a  gfvan  mapping  and  its  nm-time  patfapnance. 
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